This is an exam. You cannot discuss it with anybody. You should do the work yourself. Questions 1, 2 and 3 can be done with pen and paper. If you choose so, you can submit the scans along with other material (you also can embed them to your knitted pdf).

The submitted files must include pdf-s with your answers along with all R scripts. For example:

No pdf report - no grade. If you experience difficulties with knitting, combine your answers in Word and any other editor and produce pdf-file for grading.

No R scripts - 50 % reduction in grade if relative code present in pdf- report, 100% reduction if no such code present.

Late submissions will result in a points reduction: 10% first hour, 50% second hour, 100% six hours. Only valid and documented reason to miss the exam or be late is acceptable. Sleeping in, lack of preparation, ennui, grogginess, inability to knit, etc. are not acceptable excuses. University policy allows multiple midterm exams on same day.

Reports longer than 25 pages are not going to be graded.

Exam must be submitted through UBLearns by 11:59pm 2022 October 14th.

Question 1. Clustering with Pen and Paper (25 Points).

Consider following 6 points:

1 2 3 4 5 6
x 1 4 3 4 3 5
y 1 1 4 5 7 7

Q1.1 Perform single K-Means clustering with Manhattan distance using pen and paper. Show you work in readable manner. (20 points)

You can use rmarkdown instead of paper, simple calculator or simple vector calculations in R.

You can set points as p1 <- c(1,1), p2 <- c(4,1) and use regular math operation on them (+-*/). You also can use math function like sum, abs and so on and output function like print and cat

You will need a random sample without replacement, either of size two or six, use one generated below:

5 6 3 4 1 2
hint:
If you used pen and paper you can take a photo of it and add to Rmd using 
following markdown command:
![Note](filename.jpeg)

Answer to 1.1

Q1.2 One of the students decided to check himself and make following clustering with R (5 points):

That is the student got that first two points belong to first cluster and 4 other points to second cluster. But the students results in Q1.1 are different. Can you explain why?

hits: in general there are two reasons besides that student mess-up in Q1.1. assume that student is correct in Q1.1. Out of thouse two reasons only one actually have effect in this small example. Though if it is done not properly can play too.

clus$cluster
## [1] 1 1 2 2 2 2
print(clus$centers)
##      x    y
## 1 2.50 1.00
## 2 3.75 5.75

The centroids for the kmeans algorithm are different from the theoretical centers as we can see above. Our centroids we mean of the first half of the values and the second half. They were (2.6,2) & (4,6.3). The centroids for the above kmeans used with seed value 1 are (2.50,1) & (3.75,5.75). I believe that it might differ due to two reasons: 1. The algorithm to find the centroids for R are better or more mordern as the ones we used in our theoretical approach. 2. The seed values might affect the centroid algorithm.

Question 2 (25 Points)

Q2.1 Just like in Q.1.1 but this time perform hierarchical clustering with complete linkage on same data. (20 points)

After building whole tree, don’t forget to cut it to get two clusters. Draw dendogram by hand (show proper hights) or using ascii graphics Answer to 2.1, Page1

Answer to 2.1, Page2

Q2.1 Do you expect hclust provides same answer as yours? Check it. (5 points)

x <- c(1,4,3,4,3,5)
y <- c(1,1,4,5,7,7)
df_0 <- data.frame(x,y)
#df_0_sc <- as.data.frame(scale(df_0))
edist_mat <- dist(df_0, method = 'euclidean')
hclust_com <- hclust(edist_mat, method = 'complete')
plot(hclust_com)
rect.hclust(hclust_com , k = 3, border = 2:6)
abline(h = 3, col = 'red')

The above plot matches our manual plot so we can tell that hclust gives the same answer as our manual calculated plot.

Question 3 (10 points)

Compare and contrast k-means and hierarchical clustering in terms of input, output, speed, cluster characteristics/ability to separate manifold structures.

K-Means vs Hierarchical clustering

Question 4 (40 points)

Clustering is often used to reexamine existing classifications. Sometime misclassification can occur in the original design. For example, one class might consist of two sufficiently different groups, or two classes are essentially the same.

In this exercise, you are presented with a seed dataset containing measurements from multiple wheat subspecies. Your task is to identify a number of subspecies.

seeds.csv constists of 200 records and reports 7 measurements for each seed:

Q4.1 Read and Visualize Dataset (5 points)

# read dataset
df = fread("seeds.csv",stringsAsFactors = FALSE)
print(c("na values:",sum(is.na(df))))
## [1] "na values:" "0"
print(c("null values:",sum(is.null(df))))
## [1] "null values:" "0"
head(df)
summary(df)
##      length          width         asymmetry          groove     
##  Min.   :4.899   Min.   :2.630   Min.   :0.7651   Min.   :4.519  
##  1st Qu.:5.263   1st Qu.:2.930   1st Qu.:2.6183   1st Qu.:5.045  
##  Median :5.524   Median :3.237   Median :3.6250   Median :5.223  
##  Mean   :5.631   Mean   :3.258   Mean   :3.7143   Mean   :5.409  
##  3rd Qu.:6.000   3rd Qu.:3.562   3rd Qu.:4.7860   3rd Qu.:5.878  
##  Max.   :6.675   Max.   :4.033   Max.   :8.4560   Max.   :6.550  
##       area         perimeter      compactness    
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081  
##  1st Qu.:12.25   1st Qu.:13.45   1st Qu.:0.8566  
##  Median :14.34   Median :14.29   Median :0.8734  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8707  
##  3rd Qu.:17.33   3rd Qu.:15.74   3rd Qu.:0.8875  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183
# Examine dataset, for example by plotting it with GGally:ggpairs
data<-dplyr::select(df,length,width,asymmetry,groove,area,perimeter,compactness)
ggpairs(data)

Comment on observation, any distinct clustering?

The data which forms hyper spherical clusters are good for K-mean clustering, so from the above we can tell that the following are good for K Mean clustering: -length-asymmetry:3 clusters visible -width-asymmetry:3 clusters visible -asymmetry-groove:2 clusters visible -asymmetry-area:2 clusters visible -asymmetry-perimeter:2 clusters visible -asymmetry-compactness:2 clusters visible

Q4.2 Run PCA (8 Points)

Q4.2.1 Run PCA (don’t forget to scaled data), save as pc_out, we will use pc_out$x[,1] and pc_out$x[,2] later for plotting

df_scale <- as.data.frame(scale(df))
df_scale
pc_out <- prcomp(df_scale)
pc_out
## Standard deviations (1, .., p=7):
## [1] 2.24824359 1.08840367 0.81857786 0.25782495 0.13444617 0.07312961 0.02847917
## 
## Rotation (n x k) = (7 x 7):
##                    PC1         PC2         PC3         PC4         PC5
## length       0.4224390  0.21054814 -0.20929501  0.27957654 -0.76376099
## width        0.4321888 -0.11334615  0.21744663  0.19127128  0.46128464
## asymmetry   -0.1313456  0.71254540  0.68090375  0.09927373 -0.03360331
## groove       0.3856353  0.38601837 -0.20648132 -0.80470471  0.09474757
## area         0.4434049  0.03053263  0.02791147  0.19013824  0.21222821
## perimeter    0.4405006  0.08739006 -0.05756447  0.29323567  0.18723595
## compactness  0.2795410 -0.52680332  0.63131244 -0.32512704 -0.33716649
##                     PC6          PC7
## length      -0.26464436  0.043357115
## width       -0.70844688 -0.042629661
## asymmetry    0.02000234 -0.003594224
## groove      -0.04333067 -0.035161796
## area         0.41929322  0.738024799
## perimeter    0.47830958 -0.667221101
## compactness  0.14560864 -0.072034718

Q4.2.2 Make scree plot/percentage variance explained plot. Comment on percentage variance explained (will two first components cover enough varience in dataset)

#screeplot(pc_out, type = "line", main = "Scree plot")
exp_var = 100 * pc_out$sdev^2 / sum(pc_out$sdev^2)
print(exp_var)
## [1] 72.20856085 16.92317912  9.57242456  0.94962436  0.25822534  0.07639915
## [7]  0.01158662
# Scree plot
plot(exp_var, xlab = "Principal Components",
    ylab = "Explained variance",
     type = "b")

Percent variance explained as a cumulative sum of PC1 & PC2 is more than 80% which is enough for our analysis.

Q4.2.3 Make biplot. Comment on biplots (clusters?, possible meaning of PC?)

ggplotly(ggbiplot(pc_out,scale=1))

We can observe that the PC1 explaince 72% variance and PC2 explains 16.9% variance which sums up to give us more than 80% variance explained. Further I can see that there are 3 clusters, one from “compactness”, one from “asymmetry” abd third from all the other components.

Q4.3 Perform Clustering of your choise. Select number of clusters. (20 points)

looks like k-means should do the job, biplot has nice packed shape.

#elbow method
#compute from k=2 to k=20
#k.max <- 20
#data <- df_scale
#wss <- sapply(1:k.max, 
#              function(k){kmeans(data, k, nstart=50,iter.max = 20 )$tot.withinss})
#wss
#plot(1:k.max, wss,
#     type="b", pch = 19, frame = FALSE, 
#     xlab="Number of clusters K",
#     ylab="Total within-clusters sum of squares")
fviz_nbclust(df_scale, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")

#Gap statistics
#gap_kmeans <- clusGap(df_scale, kmeans, nstart = 20, K.max = 10, B = 100)
#plot(gap_kmeans, main = "Gap Statistic: kmeans")
set.seed(72)
fviz_nbclust(df_scale, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

#silhouette analysis
set.seed(72)
fviz_nbclust(df_scale, kmeans, method='silhouette')+
  labs(subtitle = "Silhouette method")

library("NbClust")
nb <- NbClust(df_scale, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 13 proposed 3 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
fviz_nbclust(nb)
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 13 proposed  3 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

km_3 <- kmeans(df_scale, 3, nstart = 25)
fviz_cluster(km_3, data = df_scale) + ggtitle("k=3")

From our analysis, we can tell that from elbow, gap analysis and Silhouette we are getting ideal values of k as 4,3,2 respectively. So from the above 3 itself we can tell that the ideal K should be 3 as it is the mean value(we have no majority here). I used further another library of 30 tests and found out that 13 tests resulted in K=3 with clear majority. Hence we shall used K=3 as ideal value.

Q4.4 Characterize clusters. (5 points)

df_scale$Cluster <- km_3$cluster
head(df_scale)
describeBy(df_scale, group="Cluster")
## 
##  Descriptive statistics by group 
## Cluster: 1
##             vars  n  mean   sd median trimmed  mad   min   max range  skew
## length         1 64 -0.88 0.32  -0.89   -0.88 0.33 -1.64 -0.20  1.44 -0.03
## width          2 64 -1.11 0.35  -1.15   -1.13 0.41 -1.65 -0.07  1.58  0.56
## asymmetry      3 64  0.81 0.81   0.76    0.75 0.72 -0.68  3.16  3.84  0.75
## groove         4 64 -0.58 0.34  -0.64   -0.58 0.28 -1.62  0.17  1.79 -0.21
## area           5 64 -1.04 0.24  -1.05   -1.04 0.28 -1.45 -0.52  0.94  0.31
## perimeter      6 64 -1.01 0.27  -1.02   -1.01 0.30 -1.64 -0.47  1.17  0.04
## compactness    7 64 -1.02 0.81  -0.96   -1.01 0.79 -2.64  0.74  3.38 -0.11
## Cluster        8 64  1.00 0.00   1.00    1.00 0.00  1.00  1.00  0.00   NaN
##             kurtosis   se
## length         -0.65 0.04
## width          -0.30 0.04
## asymmetry       0.47 0.10
## groove          0.12 0.04
## area           -0.97 0.03
## perimeter      -0.83 0.03
## compactness    -0.59 0.10
## Cluster          NaN 0.00
## ------------------------------------------------------------ 
## Cluster: 2
##             vars  n  mean   sd median trimmed  mad   min  max range  skew
## length         1 65  1.22 0.54   1.17    1.21 0.52  0.19 2.34  2.15  0.18
## width          2 65  1.15 0.44   1.15    1.15 0.47  0.34 2.04  1.70 -0.10
## asymmetry      3 65 -0.08 0.78  -0.06   -0.10 0.77 -1.49 1.52  3.02  0.16
## groove         4 65  1.28 0.47   1.21    1.27 0.52  0.15 2.31  2.15  0.16
## area           5 65  1.24 0.44   1.33    1.24 0.38  0.24 2.16  1.92 -0.16
## perimeter      6 65  1.25 0.42   1.27    1.25 0.42  0.25 2.04  1.80 -0.13
## compactness    7 65  0.56 0.62   0.51    0.57 0.70 -1.08 1.69  2.77 -0.17
## Cluster        8 65  2.00 0.00   2.00    2.00 0.00  2.00 2.00  0.00   NaN
##             kurtosis   se
## length         -0.76 0.07
## width          -0.92 0.05
## asymmetry      -0.92 0.10
## groove         -0.54 0.06
## area           -0.56 0.05
## perimeter      -0.64 0.05
## compactness    -0.65 0.08
## Cluster          NaN 0.00
## ------------------------------------------------------------ 
## Cluster: 3
##             vars  n  mean   sd median trimmed  mad   min  max range  skew
## length         1 71 -0.33 0.53  -0.26   -0.31 0.53 -1.63 0.65  2.28 -0.34
## width          2 71 -0.05 0.44  -0.07   -0.04 0.41 -1.00 0.85  1.85 -0.08
## asymmetry      3 71 -0.66 0.81  -0.72   -0.72 0.76 -1.97 1.98  3.95  0.75
## groove         4 71 -0.65 0.52  -0.64   -0.67 0.40 -1.80 0.95  2.75  0.39
## area           5 71 -0.20 0.40  -0.18   -0.19 0.42 -1.24 0.54  1.78 -0.33
## perimeter      6 71 -0.23 0.44  -0.21   -0.22 0.50 -1.47 0.54  2.01 -0.40
## compactness    7 71  0.41 0.69   0.44    0.42 0.70 -1.33 2.01  3.33 -0.08
## Cluster        8 71  3.00 0.00   3.00    3.00 0.00  3.00 3.00  0.00   NaN
##             kurtosis   se
## length         -0.57 0.06
## width          -0.66 0.05
## asymmetry       0.46 0.10
## groove          0.38 0.06
## area           -0.56 0.05
## perimeter      -0.40 0.05
## compactness    -0.32 0.08
## Cluster          NaN 0.00
char_data <- cbind(data, clusters= km_3$cluster) 
char_data$clusters <- as.factor(char_data$clusters)
plot_ly(char_data, y =~length ,color = ~clusters, type = "box" )  %>% layout(title = "Length in each cluster")
plot_ly(char_data, y =~width ,color = ~clusters, type = "box" )  %>% layout(title = "Width in each cluster")
plot_ly(char_data, y =~area ,color = ~clusters, type = "box" )  %>% layout(title = "Area in each cluster")

Q4.5 How many subspecies you think are in the set? (does it matches conclussion from Q4.3?) (2 points)

From the above analysis data we can tell that there are 3 subspecies in the given dataset with major difference in the length and area. It matches our biplot analysis explectations in 4.3